See “Assignment Guidelines” under Pages in CourseWorks for instructions that apply to all assignments. (The guidelines will grow as the semester progresses so check back for each new assignment.)
For tidycensus you will need to set up a Census API key. After calling the API and getting the data, save it locally and comment out the API call lines. Work with your saved local copy to avoid making repeated calls for the same data, to save time waiting for the data, and to be able to work without an internet connection. Do not share your API key. You may store it locally in .Renviron.
The only packages you need for this assignment in addition to tidyverse packages (not listed below) are sf, tidycensus, ggdensity, and tmap.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidycensus)library(sf)
Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(ggdensity)library(tmap)#used for filtering datalibrary(dplyr)#histogramlibrary(ggplot2)tmap_mode("plot")
ℹ tmap modes "plot" - "view"
ℹ toggle with `tmap::ttm()`
#set up census API Key (run once)#census_api_key("626e65050644fc48d16fb966a5e56a3c2f1f95ac", install = TRUE)
1. Westchester County Towns and School Districts
[12 points]
Use sf and tmap (version 4 code).
Obtain shape file data from https://gis.westchestergov.com/ to draw side-by-side maps of Westchester municipal boundaries and Westchester school district boundaries. Label the regions on the maps.
municipal <-"Municipal_Boundaries_/Municipal_Boundaries_.shp"school_districts <-"School_Districts/School_Districts.shp"muni <-st_read(municipal, quiet =TRUE)sd <-st_read(school_districts, quiet =TRUE)#sd <- st_transform(sd, st_crs(muni))#crs standardizationsd <-st_transform(sd, st_crs(muni))#municipal boundariesmap_muni <-tm_shape(muni) +tm_polygons(fill ="lightblue2", col ="steelblue", lwd =0.6) +tm_text("NAME", size =0.5, fontface ="bold",col ="black", options =opt_tm_text(point.label =TRUE)) +tm_title("Municipal Boundaries (Westchester)") #school district boundariesmap_sd <-tm_shape(sd) +tm_polygons(fill ="lightblue2", col ="steelblue", lwd =0.6) +tm_text("DISTNAME", size =0.5,fontface ="bold",col ="black", options =opt_tm_text(point.label =TRUE)) +tm_title("School District Boundaries (Westchester)") #Display maps side by sidetmap_arrange(map_muni, map_sd, ncol =2)
What do you observe?
School district boundaries do not always align with municipal boundaries with a municipality crossing into several school districts and vice versa. For example the Bedford school district also includes the Pound Ridge municipality in addition to the Bedford municipality for example. The municipal and district boundaries are often quite different, but there are some where theere is a similar shape like in the bottom left.
Add the Bedford school district as an additional layer on the municipal boundaries map.
names(sd)
[1] "OBJECTID" "DISTNAME" "MUNSINCL" "geometry"
#Filter Bedford school districtbedford_sd <- sd |>filter(grepl("Bedford", DISTNAME, ignore.case =TRUE))tm_shape(muni) +tm_polygons(fill ="lightblue2",col ="steelblue",lwd =0.6) +tm_text("NAME",size =0.5,fontface ="bold",col ="darkblue",options =opt_tm_text(point.label =TRUE) ) +tm_shape(bedford_sd) +tm_polygons(fill ="red",col ="red",fill_alpha =0.3,lwd =2 ) +tm_text(text =paste0(bedford_sd$DISTNAME, " School Dist"),size =0.6,fontface ="bold",col ="black",options =opt_tm_text(point.label =TRUE) ) +tm_title("Bedford School District over Municipal Boundaries")
Which towns have areas included in the Bedford school district?
Visually: Bedford, Mount Kisco, New Castle, North Castle, Pound Ridge
Lewisboro and Somers are not as visible visually & touch on the borders of the Bedford School District so could be included as validated by the code (shown below):
#VERIFYING VIA CODE# boolean overlaphits <-st_intersects(muni, bedford_sd, sparse =FALSE)[,1]# list the town namessort(unique(muni$NAME[hits]))
Available geographies: - us - region - division - state - county - county subdivision - tract - block group or cbg - block - place - alaska native regional corporation - american indian area/alaska native area/hawaiian home land - american indian area/alaska native area (reservation or statistical entity only) - american indian area (off-reservation trust land only)/hawaiian home land - metropolitan/micropolitan statistical area” (2021 5-year ACS and later) OR “metropolitan statistical area/micropolitan statistical area” OR “cbsa” - combined statistical area - new england city and town area - combined new england city and town area - urban area - congressional district - school district (elementary) - school district (secondary) - school district (unified) - public use microdata area - “zip code tabulation area” OR “zcta” - “state legislative district (upper chamber)” - “state legislative district (lower chamber)” - “voting district”
library(tidycensus)#set up census API Key (run once)#census_api_key("626e65050644fc48d16fb966a5e56a3c2f1f95ac", install = TRUE)
Get data for median household income (MHI) by census block group in Westchester County in 2023. Draw a histogram of MHI. What do you observe?
#Saving the data as an RDS locally# mhi <- get_acs(# geography = "block group",# variables = "B19013_001",# state = "NY",# county = "Westchester",# year = 2023,# geometry = TRUE# )# saveRDS(mhi, "westchester_mhi_2023.rds")# Reading saved RDS filemhi <-readRDS("westchester_mhi_2023.rds")#removing unreported/null values for histogram to avoid warning but leaving it for#chloropleth graphsggplot(mhi |>filter(is.finite(estimate)), aes(x = estimate)) +geom_histogram(fill ="lightblue", color ="darkblue", bins =40) +labs(title ="Distribution of Median Household Income (2023)",x ="Median Household Income ($)",y ="Number of Block Groups" )
The distribution shows evidence of topcoding where many blocks all report $250,000, this could be because they exceed 250,000 and therefore get classified into the $250,000 bucket. If we negate that bin we can see most of the blocks are symmetric (and could even be considered normal or right-skewed) and fall somewhere in the middle income range between the x-axis (from around $75,000 to $145,000) representing the middle class. The presence of NA (unreported incomes) in some block groups that were filtered out for this histogram should be investigated further.
Draw a choropleth map of MHI for Westchester County using the same data.
#Leaving missing data as per conversation with professor as it visualizes areas#with no data collected#choropleth (block group MHI)tm_shape(mhi) +tm_polygons(fill ="estimate", fill.scale =tm_scale_intervals( values ="brewer.blues" ),fill.legend =tm_legend(title ="Median Household Income ($)") ) +tm_title(text ="Westchester County: Median Household Income by Block Group (2023)" )
Add school district boundaries to the choropleth.
Using a spatial intersection to ensure that all school district boundaries are within Westchester County.
#sd <- get_acs(#geography = "school district (unified)",##population variable#variables = "B01003_001",#state = "NY",#year = 2023,#geometry = TRUE#)#save the data locally#saveRDS(sd, "sd_2023_unified.RDS")#read the saved file latersd <-readRDS("sd_2023_unified.RDS")#transform CRS to match your block group mapsd <-st_transform(sd, st_crs(mhi))#intersectionwestchester_boundary <-st_union(mhi)school_districts_westchester <-st_intersection(sd, westchester_boundary)
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
tm_shape(mhi) +tm_polygons(fill ="estimate", fill.scale =tm_scale_intervals(values ="brewer.blues" ),fill.legend =tm_legend(title ="Median Household Income ($)") ) +tm_shape(school_districts_westchester) +tm_borders(col ="red", lwd =1) +tm_title("Median Household Income by Block Group with School District Boundaries (2023)")
#if we only want to include boundaries of school districts all staying within #Westchester county
Describe observed patterns across and within school districts.
Many school districts have clusters of higher income level block groups within them. Larger sized area blocks with high income are clustered around the northern area and right/east side of the map, showing high median household income and equivalently capturing school districts within this region of high income communities. There are some school districts with a few low and moderate income blocks mixed in with a high income blocks, particularly in the southernmost portion and left/west side of the county. This suggests there is some socioeconomic diversity within school district boundaries. However, income levels within a single district generally fall within a relatively adjacent range of income levels (colors), indicating that while variation exists, most households in a district tend to have comparable income levels overall.
3. Bachelors degrees
[14 points]
Use tidycensus (see NOTES above) and tmap (version 4 code).
Maps do not need to be labeled.
Get data for the total number of people over 25 years with bachelor’s degrees by census tract in Westchester County in 2023. Draw a choropleth map.
# bach <- get_acs(# geography = "tract",# Bachelor's degree# variables = "B15003_022", # state = "NY",# county = "Westchester",# year = 2023,# survey = "acs5",# geometry = TRUE# )tmap_options(component.autoscale =FALSE)#saveRDS(bach, "westchester_bachelors_tract_2023.rds")bach <-readRDS("westchester_bachelors_tract_2023.rds")#as per professors instructions to handle missing data we can make it #0 or filter it out later#because later we need to calcaulate the percentagesbach_clean <- bach # |> dplyr::filter(is.finite(estimate))#draw choropleth (tmap v4)tm_shape(bach_clean) +tm_polygons(#counts of adults 25+ with bachelor'sfill ="estimate", col ="gray40", lwd =0.1, fill.scale =tm_scale(values ="brewer.blues" ),fill.legend =tm_legend(title ="Adults 25+ with Bachelor's") ) +tm_borders(col ="black", lwd =0.4) +tm_title("Westchester County — Bachelor's Degree Holders (25+) \n Count by Tract")
The graph from part a) is potentially misleading because it doesn’t take population into account. Create a new measure for bachelor’s degrees as a percentage of the total population over 25 years and redraw using linear fill scale breaks.
For some areas there is no population (marked as missing in the dataset) and therefore no adults with Bachelor’s degrees. For these areas I modified the population from NA to 0 so that the percentage of adult’s with bachelor’s degrees is also 0 (as per my discussion with the professor).
#get denominator: total population 25+ (same geo/time as part a)# total25 <- get_acs(# geography = "tract",# variables = "B15003_001", # total pop 25+# state = "NY",# county = "Westchester",# year = 2023,# survey = "acs5",# geometry = FALSE # )# # #save locally# saveRDS(total25, "total25_2023_Westchester.RDS")#Later, when reopening project, load it from the saved file instead of calling get_acs again:total25 <-readRDS("total25_2023_Westchester.RDS")#compute percentage#make missing value into 0 as per instruction by professor#the regions with no population (NA population) will also have 0 people with #bachelors degrees bach_pct <- bach_clean |>select(GEOID, NAME, geometry, bach = estimate) |>left_join(total25 |>select(GEOID, total25 = estimate), by ="GEOID") |>mutate(bach =ifelse(is.na(bach), 0, bach), #replace missing bachelor's countstotal25 =ifelse(is.na(total25) | total25 ==0, 1, total25), pct_bach25 =100* bach / total25 )#choose equal-interval breaks once so you can reuse in (c)brks <-c(0, 20, 40, 60, 80, 100)labs <-c("0–<20", "20–<40", "40–<60", "60–<80", "80–100")#redraw choropleth with percentagestm_shape(bach_pct) +tm_polygons(fill ="pct_bach25",col ="gray40", lwd =0.1,fill.scale =tm_scale_intervals(breaks = brks, values ="brewer.blues",labels = labs ),fill.legend =tm_legend(title ="Adults 25+ with Bachelor's (%)") ) +tm_borders(col ="black", lwd =0.4) +tm_title("Westchester County % Bachelor's (Adults 25+) by Tract")
Add a histogram showing the distribution of the bins, using the same fill colors as in the map. What do you observe?
#bin into the same intervalshist_df <- bach_pct |>mutate(bin =cut(pct_bach25, breaks = brks, right =FALSE, include.lowest =TRUE))#match tmap colors (6 bins)pal6 <- cols4all::c4a("brewer.blues", 5)ggplot(hist_df, aes(x = bin, fill = bin)) +geom_bar(color ="gray30", width =1) +scale_x_discrete(drop =FALSE) +scale_fill_manual(values = pal6, drop =FALSE, guide ="none") +labs(title ="Distribution of % Bachelor's (Adults 25+)",x ="Bin (same as map)",y ="Number of tracts" ) +theme(axis.text.x =element_text( hjust =1))
I observe that most tracts in Westchester have under 40% of adults 25+ who have a bachelor’s degree, with the largest number of tracts having between 20-40% of adults with a bachelor’s degree (right-skew). The map corroborates this as well with the majority of the map being the shade of blue described by 20-40% here in this histogram as well. There are no tracts with 80-100% (and very view with 60-80%) showing us that the high extreme of nearly all adults with a bachelors degree isn’t represented in any tract within Westchester County.
Draw two more choropleths with the same data, changing the fill scale breaks from what you used in part b). Be sure that your graphs convey the scale break method in as clear a way as possible.
#1) Quantile breaks (≈ same number of tracts per class)tm_shape(bach_pct) +tm_polygons(fill ="pct_bach25",col ="gray40", lwd =0.1,fill.scale =tm_scale_intervals(style ="quantile", n =5,values ="brewer.blues" ),fill.legend =tm_legend(title ="Adults 25+ with Bachelor's (%)") ) +tm_title("Westchester % Bachelor's Degrees (Adults 25+)\nQuantile Breaks (5 classes)")
#2) K-Means (clusters tracts with similar % bachelor's values)tmap_mode("plot")
Discuss the results of part d). What are advantages and disadvantages of the different methods?
Quantile breaks tries to ensure that every colored class has around the same number of tracts. This is great because every color appears equally often, making patterns easy to compare, especially in a graph with as many spatial tracts as here in Westchester County. we want to be able to make distinctions between the top and middle and bottom areas. However, this can be misleading if a viewer is not just trying to understand relative percentages. A tract may appear in the “top” color class but still have a small percentage of college graduates when viewed on the full 0–100% scale. We can also get that small differences by the cutoff determined by the quantile breaks makes a tract fall into one color or another exaggerating their differences.
In our Quantile map, we can see each color appears roughly the same number of times and is balanced across the map. However, the highest class (33-77%) is a very wide range to get the darkest color to have enough of a presence in the map, making actual high educational tracts on the top of that range not stand out. This means the map emphasizes relative ranking (top, middle, etc.) rather than actual differences in education levels.
On the other hand, K-Means partitions tracts into groups of similar values and minimizing variance within a group so that we can emphasize clusters with similar values (% of people with a bachelors degree). Unlike quantile, we don’t have to get a certain number of tracts per color/class so we can see different variances in the number of each color assigned to tracts.
In our graph there is a reduction in the number of darkest blue tracts when using K-Means so we don’t have to include a certain number of those darkest counties. We also see few regions with the very lightest range where there is a small percentage of people with bachelors degrees. Though some color classes cover more tracts than others, this method better reflects the actual structure of the data and we can see there are lots of clusters falling into neither extreme, unlike how quantiles exaggerates the extremes.
Besides part d, we did linear coloring in earlier parts where I split the dataset into equal bins from 0-100%. This was good to use the entire range of percentages but also we noticed that for example the 80-100% bin was not utilized (with the largest % capping out under 80). Thus, while linear classification ensures consistency in interval size, it may fail to highlight differences effectively when data are skewed or concentrated in one range.
4. Utility Poles
[12 points]
Use sf and ggplot2
Obtain the coordinate locations of utility poles in Westchester County from https://gis.westchestergov.com/. Show the dimensions of the dataset.
poles =st_read('Utility_Poles/Utility_Poles.shp',quiet=TRUE)dim(poles)
[1] 187496 5
Plot the locations of the points. You do not need to show a background map.
ggplot(data = poles) +geom_sf(size =0.5, colour ="blue") +labs(title ="Utility Poles in Westchester County",x ="Longitude", y ="Latitude")
Create a density heatmap of the utility poles using the ggdensity package.
library(ggdensity)#library(viridis)poles_6539 <-st_transform(poles, 6539)poles_df <-cbind(poles_6539, st_coordinates(poles_6539))#IF HDR IS REQUIREDggplot(poles_df, aes(X, Y)) +geom_hdr(probs =c(0.001, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.999)) +coord_sf() +scale_fill_viridis_d(labels =function(l) paste0(as.numeric(l) *100),option ="plasma" ) +labs(title ="Utility Pole Highest Density Regions in Westchester County",x ="Longitude", y ="Latitude" )
#MAKING IT LOOK BETTER WITH STAT DENSITY / SCALE FILL#decided to use the above method as per professor instruction# ggplot(poles_df, aes(X, Y)) +# stat_density_2d_filled(contour_var = "ndensity") +# scale_fill_viridis_d(option = "cividis")+# coord_equal() +# scale_fill_viridis_d(# option = "plasma"# ) +# labs(# title = "Utility Pole Highest Density Regions",# x = "Longitude", y = "Latitude"# )
Add municipal boundaries with labels.
municipal_6539 <-st_transform(muni, 6539)#muni_labels <- st_point_on_surface(municipal_6539)muni_labels <-suppressWarnings(st_centroid(municipal_6539))ggplot(poles_df, aes(X, Y)) +geom_hdr(probs =c(0.001, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.999)) +geom_sf(data = muni, fill =NA, color ="black", linewidth =0.4, inherit.aes =FALSE) +geom_sf_text(data = muni, aes(label = NAME), color ="red", size =3, check_overlap =TRUE, inherit.aes =FALSE) +coord_sf() +scale_fill_viridis_d(labels =function(l) paste0(as.numeric(l) *100),option ="plasma" ) +labs(title ="Utility Pole Highest Density Regions with Municipal Boundaries in Westchester County",x ="Longitude", y ="Latitude" )
Describe the most prominent cluster of towns in terms of utility pole density.
The most prominent cluster of towns is in the southernmost portion of Westchester County, as the density map shows; where there is a large clustering of dark regions from the density heatmap overlapping several towns including Yonkers, Mount Vernon, Pelham Manor, and Bronxville. The density gradient decreases northward into more rural areas like Lewisboro and North Salem.